A Constraint-Based Editor for Linguistic Scholars
نویسندگان
چکیده
Corpus linguistics is a branch of linguistics in which the scholars analyse large collections of electronic documents. For English there are many such corpora. The earliest and best known is the Brown Corpus [1]. The Linguistic Data Consortium at the University of Pennsylvania is collecting several hundred million words of English. The Text Encoding Initiative (TEI) [2] is specifying SGML Document Type Definition standards for corpus linguistics, which have already been put to use in the 100 million word British National Corpus (BNC) [3]. The International Corpus of English (ICE) [4], of which one of us (Meyer) leads the American effort, is collecting 1 million words of written and spoken English from each of 15 countries. To be useful for electronic analysis, documents once transcribed into electronic form must have some conventional markup inserted to mark the boundaries of whatever linguistic constructs are to be studied. Some of this markup, such as that implied by the natural language parsing, can be inserted by parsing programs with, perhaps, a small amount of human intervention to disambiguate the parse (cf. [5]). When the transcription is of spoken language, the state of natural language parsing is inadequate to this task, and even for written text some things do not yield to machine marking. For example, in order to support correct analyses, mis-spellings may be corrected by a (human) corpus editor, and the corpus must retain both the original and the ‘normalization’, both suitably marked. Therefore, perhaps after initial machine markup a document must have markup added by a human, typically with some linguistic training. This is a time consuming task, prone to logical errors if done without adequate software. Our program supports the addition of ICE markup in ways that make impossible the addition or deletion of markup in violation of constraints defined by the markup scheme, while at the same time permitting the full editing of the text itself. The constraints are deduced from a hierarchy we have placed on the ICE markup (none is officially specified), but are easily changed.
منابع مشابه
An Authoring Tool for Informal and Formal Requirements Specifications
We describe foundations and design principles of a tool that supports authoring of informal and formal software requirements specifications simultaneously and from a single source. The tool is an attempt to bridge the gap between completely informal requirements specifications (as found in practice) and formal ones (as needed in formal methods). The user is supported by an interactive syntax-di...
متن کاملSoheili. A. Reflections on Persian Grammar: Developments in Persian Linguistic Scholarship I. Cambridge: Cambridge Scholars publishing, 2017. 283 pp. ISBN: 1-4438-5070-5
متن کامل
Cross-linguistic Influence at Syntax-pragmatics Interface: A Case of OPC in Persian
Recent research in the area of Second Language Acquisition has proposed that bilinguals and L2 learners show syntactic indeterminacy when syntactic properties interface with other cognitive domains. Most of the research in this area has focused on the pragmatic use of syntactic properties while the investigation of compliance with a grammatical rule at syntax-related interfaces has not received...
متن کاملNew Applications on Linguistic Mathematical Structures and Stability Analysis of Linguistic Fuzzy Models
In this paper some algebraic structures for linguistic fuzzy models are defined for the first time. By definition linguistic fuzzy norm, stability of these systems can be considered. Two methods (normed-based & graphical-based) for stability analysis of linguist fuzzy systems will be presented. At the follow a new simple method for linguistic fuzzy numbers calculations is defined. At the end tw...
متن کاملLexical Bundles in English Abstracts of Research Articles Written by Iranian Scholars: Examples from Humanities
This paper investigates a special type of recurrent expressions, lexical bundles, defined as a sequence of three or more words that co-occur frequently in a particular register (Biber et al., 1999). Considering the importance of this group of multi-word sequences in academic prose, this study explores the forms and syntactic structures of three- and four-word bundles in English abstracts writte...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Electronic Publishing
دوره 6 شماره
صفحات -
تاریخ انتشار 1993